Which chemical properties influence the quality of red wines?

Austin J. Alexander


Check out the data

show column names
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
basic stats for each column
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
check for any NaN values
## [1] 0
check the distribution of quality
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As can be seen above, the vast majority of wines are scored 5 or 6.

proportions

3’s and 4’s

## [1] 0.03939962

5’s and 6’s

## [1] 0.8248906

7’s and 8’s

## [1] 0.1357098

Scores of 5 and 6 account for over 82% percent of all scores! This suggests that the most useful information might be found by examining the lowest and highest scorers, but we’ll save that for later.

Univariate Plots Section

visualize quality

Examine correlations (technically these are bivariate plots, but the correlation coefficients are displayed nicely).

Alcohol, volatile.acidity, and sulphates present correlation coefficient values furthest from zero, so let’s examine these further.

facet alcohol by quality

view alcohol as a histogram colored by quality

facet volatile.acidity by quality

view volatile.acidity as a histogram colored by quality

facet sulphates by quality

view sulphates as a histogram colored by quality

Univariate Analysis

What is the structure of your dataset?

There are 1,599 observations of red wines with 12 recorded features for each observation. Some of the features are related to each other (e.g., those related to acidity). Quality is the only categorical feature.

What is/are the main feature(s) of interest in your dataset?

Alcohol, volatile.acidity, and sulphates seem to be the features most highly correlated with quality scores.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It was unclear what features will be useful at this point.

Did you create any new variables from existing variables in the dataset?

No, there didn’t seem to be much of a need to create new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

None of the features seemed unusual enough to explore futher and, no, I didn’t bother tidying/adjusting the form of the data at this point.


Bivariate Plots Section

Continue to focus on alcohol, volatile.acidity, and sulphates.

alcohol vs (jittered) quality colored by quality

volatile.acidity vs (jittered) quality colored by quality

sulphates vs (jittered) quality colored by quality

boxplot of alcohol colored by quality

boxplot of volatile.acidity colored by quality

boxplot of sulphates colored by quality

alcohol vs volatile.acidity faceted by quality

alcohol vs sulphates faceted by quality

group reds by quality
quality_groups <- group_by(reds, quality)
grouped_reds <- summarise(quality_groups, 
                alcohol_mean = mean(alcohol),
                volatile_acid_mean = mean(volatile.acidity),
                sulphates_mean = mean(sulphates),
                n = n())
grouped_reds <- arrange(grouped_reds, quality)
quality vs alcohol mean

quality vs alcohol with quantile lines

quality vs volatile.acidity mean

quality vs volatile.acidity with quantile lines

quality vs sulphates mean

quality vs sulphates with quantile lines

alcohol vs volatile.acidity colored by quality

with quality smoothing lines

now only 3’s, 4’s, 7’s, and 8’s

with quality smoothing lines

now only 3’s and 8’s

alcohol vs sulphates

with quality smoothing lines

now only 3’s and 8’s

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Simply put: alcohol, volatile.acidity, and sulphates (particularly the first two) appear to have an affect of the quality scores. Alcohol will be discussed below, but, in general, the lower the volatile acidity, the higher the quality score; the inverse is true for sulphates and quality score.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I didn’t bother looking at the other features because I’m focusing on answering the primary question driving this project.

What was the strongest relationship you found?

Alcohol. Funny enough, a higher alcohol content seems to encourage a higher score.


Multivariate Plots Section

alcohol vs volatile.acidity, colored by alcohol, faceted by quality

alcohol vs sulphates, colored by alcohol, faceted by quality

alcohol vs volatile.acidity, colored by quality, sized by sulphates

alcohol vs volatile.acidity, colored by quality, sized by sulphates, with contour lines showing overall quantity clustering

alcohol vs volatile.acidity with only contour lines colored by quality showing quality clusters

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As my examination continued, I felt better and better about the apparent relationship between alcohol, volatile.acidity, and sulphates and quality scores.

Were there any interesting or surprising interactions between features?

I found it interesting that sulphate levels seem to have a sweet spot when it comes to quality scores.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

attempts at a simple linear regression model
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = reds)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = reds)
## m5: lm(formula = quality ~ alcohol:volatile.acidity:sulphates, data = reds)
## m6: lm(formula = quality ~ alcohol * volatile.acidity * sulphates, 
##     data = reds)
## 
## ===================================================================================================
##                                            m1        m2        m3        m4        m5        m6    
## ---------------------------------------------------------------------------------------------------
## (Intercept)                              1.875***  3.095***  2.611***  2.611***  5.763***   1.285  
##                                         (0.175)   (0.184)   (0.196)   (0.196)   (0.058)    (2.188) 
## alcohol                                  0.361***  0.314***  0.309***  0.309***             0.426* 
##                                         (0.017)   (0.016)   (0.016)   (0.016)              (0.209) 
## volatile.acidity                                  -1.384*** -1.221*** -1.221***             9.044* 
##                                                   (0.095)   (0.097)   (0.097)              (4.030) 
## sulphates                                                    0.679***  0.679***             2.713  
##                                                             (0.101)   (0.101)              (3.226) 
## alcohol x volatile.acidity x sulphates                                          -0.036*     1.524* 
##                                                                                 (0.015)    (0.593) 
## alcohol x volatile.acidity                                                                 -0.996* 
##                                                                                            (0.389) 
## alcohol x sulphates                                                                        -0.184  
##                                                                                            (0.309) 
## volatile.acidity x sulphates                                                              -15.622* 
##                                                                                            (6.130) 
## ---------------------------------------------------------------------------------------------------
## R-squared                                   0.227     0.317     0.336     0.336     0.003     0.351
## adj. R-squared                              0.226     0.316     0.335     0.335     0.003     0.349
## sigma                                       0.710     0.668     0.659     0.659     0.806     0.652
## F                                         468.267   370.379   268.912   268.912     5.414   123.160
## p                                           0.000     0.000     0.000     0.000     0.020     0.000
## Log-likelihood                          -1721.057 -1621.814 -1599.384 -1599.384 -1923.929 -1580.453
## Deviance                                  805.870   711.796   692.105   692.105  1038.644   675.909
## AIC                                      3448.114  3251.628  3208.768  3208.768  3853.857  3178.905
## BIC                                      3464.245  3273.136  3235.654  3235.654  3869.988  3227.300
## N                                        1599      1599      1599      1599      1599      1599    
## ===================================================================================================

Since model 6 had the highest R^2 value, I tested it with some obvious extreme cases (based on what seems to have been discovered above) using only values for alcohol, volatile.acidity, and sulphates:

  1. A fake good red was created using a high alcohol content, low volatile.acidity, and average (median) sulphates level and is expected to have a high quality score.
fake good red quality score prediction (95% confidence interval)
##       fit    lwr      upr
## 1 7.43747 6.1107 8.764241
  1. A fake mid red was created using an average (median) alcohol content, average (median) volatile.acidity, and average (median) sulphates level and is expected to have a mid-range quality score.
fake mid-range red quality score prediction (95% confidence interval)
##        fit      lwr      upr
## 1 5.538351 4.259407 6.817296
  1. A fake bad red was created using a low alcohol content, high volatile.acidity, and low sulphates level and is expected to have a low quality score.
fake bad red quality score prediction (95% confidence interval)
##        fit      lwr      upr
## 1 4.845372 3.271417 6.419327

As should be somewhat expected from the entire investigation so far, combined with the not-too-shabby R^2 value of the simple linear regression model we selected, these predictions were spot on.

Strength of this model: it works for the obvious cases. Weakness of this model: it’s unclear how robust it is.


Final Plots and Summary

Plot One

alcohol vs volatile.acidity, colored by alcohol, sized by sulphates, faceted by quality

Description One

This plot makes it easy to see the distribution of quality scores (most are in the middle), the rightward trend of alcohol content, the downward slope of volatile.acidity, and the mid-range sweet-spot of sulphates levels all in relation to quality scoring.

Plot Two

quality vs alcohol, colored by volatile.acidity, sized by sulphates, with a smoothing line

Description Two

Although alcohol and quality are swapped from their perhaps expected axis locations, the swapping, along with the smoothing line, makes it clear that as quality increases, so do alcohol content (and, thus, the reverse relationship is true). Volatile.acidity and sulphates continue to play supporting roles.

Plot Three

alcohol vs volatile.acidity with only clusters colored by quality

Description Three

Perhaps my favorite plot, borrowing from the experimentation with contours earlier, this plot, although leaving out sulphates, makes it clear that there are distinct clusters of quality scores that are quite obviously related to volatile.acidity and alcohol levels. If I were given a new red wine with only those two features listed, I would be very confident using merely this plot to predict the quality score (assuming the same wine experts responsible for this data set).


Reflection

A fruitful exercise, this project exposed two or three features of red wines that, when related to one another, seem to lead to obvious groupings. Alcohol, volatile.acidity, and sulphates (in that order) appear to affect the (perceived) quality of red wines, at least among those wine experts consulted in the making of this data set.